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Abstract 



Many collective human activities have been shown to exhibit uni- 
HH versal patterns. However, the possibility of universal patterns across 

**\ timing events of researcher migration has barely been explored at 

c^ global scale. Here, we show that timing events of migration within 

different countries exhibit remarkable similarities. Specifically, we look 
at the distribution governing the data of researcher migration inferred 
from the web. Compiling the data in itself represents a significant 
advance in the field of quantitative analysis of migration patterns. Of- 
ficial and commercial records are often access restricted, incompati- 
Q^ ble between countries, and especially not registered across researchers. 

C^» Instead, we introduce GeoDBLP where we propagate geographical 

<^+ seed locations retrieved from the web across the DBLP database of 

^^ 1,080,958 authors and 1,894,758 papers. But perhaps more important 

CO is that we are able to find statistical patterns and create models that 

explain the migration of researchers. For instance, we show that the 
^ science job market can be treated as a Poisson process with individual 

propensities to migrate following a log-normal distribution over the 
researcher's career stage. That is, although jobs enter the market con- 
stantly, researchers are generally not "memoryless" but have to care 
greatly about their next move. The propensity to make k > 1 migra- 
tions, however, follows a gamma distribution suggesting that migration 
at later career stages is "memoryless" . This aligns well but actually 
goes beyond scientometric models typically postulated based on small 
case studies. On a very large, transnational scale, we establish the first 
general regularities that should have major implications on strategies 
for education and research worldwide. 



1 Introduction 

Over the last years, many collective human activities have been shown to 
exhibit universal patterns, see e.g. |33l ED El El El [121 IIHl IIIl El ITZl IIDl Ell El 

among others. However, the possibility of universal patterns across timing 
events of researcher migration — the event of transfer from one residential 
location to another by a researcher — has barely been explored at global 
scale. This is surprising since education and science is, and has always 
been international. For instance, according to the UNESCO Institute for 
Statistics, the global number of foreign students pursuing tertiary education 
abroad increased from 1.6 million in 1999 to 2.8 million in 200^3 As the UN 
notes [30J , "there has been an expansion of arrangements whereby universi- 
ties from high-income countries either partner with universities in developing 
countries or establish branch campuses there. Governments have supported 
or encouraged these arrangements, hoping to improve training opportuni- 
ties for their citizens in the region and to attract qualified foreign students." 
Likewise, science thrives on the free exchange of findings and methods, and 
ultimately of the researchers themselves, as noted by the German Council 
of Science and Humanities [9j. The European Union even defined the free 
movement of knowledge in Europe as the "fifth fundamental freedom" [j 
Similarly, the US National Science Foundation argues that "international 
high-skill migration is likely to have a positive effect on global incentives for 
human capital investment. It increases the opportunities for highly skilled 
workers both by providing the option to search for a job across borders and 
by encouraging the growth of new knowledge" [25J. Generally, due to glob- 
alization and rapidly increasing international competition, today's scientific, 
social and ecological challenges can only be met on a global scale both in 
education and science, and are accompanied by political and economic in- 
terests. Thus, research on scientist's migration and understanding it, play 
key roles in the future development of most computer science departments, 
research institutes, companies and nations, especially if fertility continues 
to decline globally [16]. But can we provide decision makers and analysts 
with statistical regularities of migration? Are there any statistical patterns 



1 United Nations Education, Scientific and Cultural Organization, Data extract 
(Paris, 2011), accessed on 19 April 2011 at: http://stats.uis.unesco.org/unesco/ 
TableViewer 

2 Council of the European Union (2008a), p. 5: "In order to become a truly modern 
and competitive economy, and building on the work carried out on the future of science 
and technology and on the modernization of universities, Member States and the EU must 
remove barriers to the free movement of knowledge by creating a 'fifth freedom' ..." 



at all? 

These questions were the seed that grew into the present report. On 
first sight, reasons to migrate are manifold and complex: political stability 
and freedom of science, family influences such as long distance relationships 
and oversea relatives, and personal preferences such as exploration, climate, 
improved career, better working conditions, among others. Despite this 
complex web of interactions, we show in this paper that the timing events 
of migration within different countries exhibit remarkable simple but strong 
and similar regularities. Specifically, we look at the distribution governing 
the data of researcher migration inferred from the web. Compiling the data 
in itself represents a significant advance in the field of quantitative analysis 
of migration patterns. Although, efforts to produce comparable and reliable 
statistics are underway, estimates of researcher flows are inexistent, out- 
dated, or largely inconsistent, for most countries. Moreover, official (NSF, 
EU, DFG, etc.) and commercial (ISI, Springer, Google, AuthorMapper, Ar- 
netMinder) records are often access restricted and especially not registered 
across researchers. On top of it, these information sources are often highly 
noisy. Luckily, bibliographic sites on the Internet such as DBLP are publicly 
accessible and contain data for millions of publications. Papers are written 
virtually everywhere in the scientific world, and the affiliations of authors 
tracked over time could be used as proxy for migration. Unfortunately, many 
if not most of the prominent bibliographic sites such as DBLP do not pro- 
vide affiliation information. Instead, we have to infer this information. To 
do so, we extracted the geographical locations — the cities — for a few seed 
author-paper-pairs and then propagated them across the DBLP social net- 
work of more than one million authors and almost two million papers. We 
refer to this new dataset as GeoDBLP, DBLP augmented with geo-tags. 
GeoDBLP is the basis for our statistical analysis and has city-tags for most 
of the 5,033,018 paper-author-pairs in DBLP. Specifically, as partly shown 
in Fig. [TJ we present the first strong regularities for researcher migration in 
computer science: 

• (Rl) A specific researcher's propensity to migrate, that means to make 
the next move, follows a log-normal distribution. That is, researchers 
are generally not "memoryless" but have to care greatly about their 
next move. This is plausible due to the dominating early career re- 
searchers with non-permanent positions. This regularity of timing 
events is remarkably stable and similar within different continents and 
countries across the globe. 

• (R2) The propensity to make k > 1 migrations, however, follows a 
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Figure 1: We infer from the WWW the first strong regularities of tim- 
ing events in the migration of computer scientists. Due to the many early 
stage careers, with non-permanent contracts, a specific scientist's propen- 
sity to make the next move follows a log-normal distribution (left). For 
larger numbers of moves, i.e., for senior scientists this turns into a gamma 
distribution due to permanent positions (left-middle); migration becomes 
memoryless. The circulation of expertise, i.e., the time until a researcher 
returns to the country of her first publication follows a gamma distribution 
(middle-right). Returning is also memoryless. The inter-city migration fre- 
quency distribution, however, follows a power-law (right). That is, cities 
with a high exchange of researchers will even exchange more researchers in 
the future. These regularities should have major implications on strategies 
for research across the world. 



gamma distribution suggesting that migration at later career stages is 
"memoryless". That is, researchers have to care less about their next 
move since the majority of positions are permanent in later career 
stages. 

• Since jobs enter the market all the time, Rl and R2 together suggest 
that the job market can be treated as a Poisson-log-normal process. 

• (R3) The brain circulation, i.e., the time until a researcher returns 
to the country of her first publication, follows a gamma distribution. 
That is, returning is also memoryless. Researchers cannot plan to 
return but rather have to pick up opportunities as they arrive. 

• (R4) The inter-city migration frequency follows a power-law. That 
is, cities with a high exchange of researchers will exchange even more 
researchers in the future. So, investments into migration pay off. 

• Statistical patterns: Link analysis of the author-migration graph 
can discover additional statistical patterns such as (SP1) migration 
sinks, sources and incubators, as well as (SP2) the hottest migration 
cities. 



These results validate and go beyond migration models based on small case 
studies at a very large, transnational scale. Ultimately, they can provide 
forecasts of (re-) migration which can help decision makers who seek actively 
the migration and the return of their researchers to reach better decisions 
regarding the timing of their efforts. 

Already Zipf [34J investigated inter-city migration. He analyzed so called 
gravity models. These models incorporated terms measuring the masses of 
each origin and destination and the distance between them and were cali- 
brated statistically using log-linear regression techniques. Over the years, 
several modifications and alternatives have been postulated, see e.g. O |27] 
and the references in there. Steward [28] reviewed the Poisson-log-normal 
model for bibliometric/scientometric distributions, i.e., to characterize the 
productivity of scientists. Sums of Poisson processes and other Poisson 
regression models as well as ordinary-least-squares have actually a long tra- 
dition within migration research, see [29j EI] for recent overviews. All of 
these approaches, however, have considered small scale data only [21] and 
have not considered researcher migration in computer science. To the best 
of our knowledge, the only large-scale migration study was recently pre- 
sented by Zagheni and Weber [32], analyzing a large-scale e-mail datasets to 
estimate international migration rates, but not specific to computer scien- 
tists. Moreover, they have not presented any statistical regularities nor dealt 
with missing information. Indeed, as already mentioned, other collective hu- 
man activities have been the subject of extensive and large-scale planetary 
mining. Prominent examples are mobility patterns drawn from communica- 
tion [THE] an d we b services [22], as well as mining blog dynamics [10] and 
social ties [31] . Our methods and findings complement these results by high- 
lighting the value of using the World Wide Web together with data mining 
to deal with missing information as a world-wide lens onto researcher migra- 
tion, enabling the analyst to develop global strategies for research migration 
and to inform the public debate. 

We proceed as follows. We start by discussing the harvesting of our 
data in detail. Then, we will describe how we made use of multi-label 
propagation to fill in missing information. Before concluding, we will present 
our statistical migration models and patterns. 

2 Mining the Data from the Web 

Bibliographic sites on the Internet such as DBLP are publicly accessible 
and contain millions of data records on publications. Papers are written 
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Figure 2: Statistics of the DBLP dump: The number of publications (2a) and 



authors (2b) per year. As one can see, DBLP has been growing constantly 



over the past decades from 1970 until 2010. 

virtually everywhere in the scientific world, and the affiliations of authors 
tracked over time could be used as proxy for migration. Unfortunately, 
many if not most of the prominent bibliographic sites such as DBLP do not 
provide affiliation information. Instead we have to infer this information. In 
this section, we will detail the mining of our data. The goal was to tag every 
of the over 5 million author-paper-pairs in our database with an affiliation. 
The data collection method utilized an open-source information extraction 
methodology, namely DBLP, ACM Digital Library, Google's Geocoding API 
and large-scale multi-label propagation. 



2.1 Harvesting the Data 



We used DBLP|j as a starting point. DBLP is a large index of computer 
science publications that also offers a manual best-effort entity disambigua- 
tion [19J. We used an XML-dump from February 2012 which contained 
1,894,758 publications written by 1,080,958 authors. Fig. [2]shows the num- 
ber of publications and authors per year from this dump. As one can see, 
the number of computer scientists as well as the productivity have been 
growing enormously over the past decades. Unfortunately, DBLP does not 
provide affiliation information for the authors over the years. This informa- 
tion, however, is required in order to develop migration models using author 
affiliations as proxy. Specifically, we aim to infer geo-tags of the more than 
5 million unknown author-paper-pairs. 

Luckily, there are other information sources on the web that contain such 
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information. One of these systems is the ACM Digital Librar^ Unfortu- 
nately, ACM DL does not allow a full download of the data. Consequently, 
we retrieved the affiliation information of only a few papers from ACM DL 
which we then had to match with our DBLP dump. This resulted in affil- 
iation information for 479,258 of all author-paper-pairs. In order to fill in 
the missing information, we resorted to data mining techniques. To do so, 
however, we have to be a little bit more careful. First, the names of the af- 
filiations in ACM DL are not in canonical form which results in a very large 
set of affiliation candidates. More precisely, the DBLP dump enhanced with 
the initial affiliations from ACM DL contained 159,068 different affiliation 
names in total. Secondly, although we have now partial affiliation infor- 
mation, we still lack exact geo-information of the organizations to identify 
cities, countries, and continents. Many of the affiliation names may contain 
a reference to the city or country but these pieces of information are not 
trivial to extract from the raw strings. Additionally, we want to have lati- 
tude and longitude values to enable further analysis and visualization. For 
example, latitude and longitude data would allow one to calculate exact dis- 
tances between collaborators. This geo-location issue can easily be resolved 
using Google's Geocoding AP1Q Just querying the API using the retrieved 
affiliation names resulted in geo-locations for 117,942 of the 159,068 strings. 
The remaining gap primarily rises from the fact that the Google API does 
not find geo-locations for all the retrieved affiliation strings. This is essen- 
tially because the strings contain information not related to the geo-location 
such as departments, e-mail addresses, among others. In any case, as our 
empirical results will show, this resulted in enough information to propagate 
the seed affiliations and in turn the geo-locations across the DBLP network 
of authors and papers. 

2.2 Inferring Missing Data 

Before we infer the missing author-paper-pairs, we revise our obtained affil- 
iation data. To further increase the quality of our harvested affiliations, we 
hypothesized that there are actually not that many relevant organizations in 
Computer Science and these names need to get de-duplicated. This hypoth- 
esis is confirmed by services such as MS Academic Searcf^\ which currently 
lists only 13,276 organizations compared to our 150k+ names. Since, we 
now have the geo-locations for many of the affiliation strings, we can use 
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Figure 3: Example database. 

this information for a simple entity resolution which helps resolving this is- 
sue. More precisely, we clustered affiliations together for which the retrieved 
city coincide resulting in 4,254 distinct citie^j 

The city-based entity resolution resulted in a dataset with approximately 
10% of the author-paper-pairs being geo-tagged. Based on these known 
geo-locations, we will now fill in the missing ones. To do so, we essentially 
employ Label Propagation [3J[33] (LP), a semi-supervised learning algorithm, 
to propagate the known cities to the unknown author-paper-pairs based on 
the similarity between the pairs. LP works on a graph based formulation of 
the problem and propagates node labels along the edges. We define the LP 
graph as an undirected graph G = (V, E) with nodes V and edges E. We 
have a node in V for every author-paper-pair that we want to label with a 
city. Every edge e^- E E between two nodes i and j contains a weight Wij 
that is proportional to the similarity of the nodes. 

We will now explain in detail when two nodes are connected by an edge 
and how the weight Wij for that edge is set. Intuitively, the weight of an edge 
is proportional to the similarity of the nodes and we define the similarity 
of two nodes based on relations such as co-authorship between the authors 
associated with the nodes. Only those nodes are connected via an edge 
where wij > 0. Specifically, in order to define the edges, we considered the 
following functions over the set of nodes that return facts about the nodes: 
author (z), paper (z), and year(z). For example, author (z) essentially 
"returns" the author of an author-paper node. Based on these functions, we 
can now define logic based rules that add a rule-specific weight A& to every 
matching edge e^-. Initially, we set all weights W{j to zero. The first rule, 

Wij + = Ai if paper (i) = paper (j) 



7 Indeed, this approach does not distinguishing multiple affiliations per cities such as 
MIT and Harvard. However, it is simple and effective, and — as our empirical results 
show — the resolution is sufficient to establish strong regularities in the timing events. 
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(a) The graph for city propagation (b) Completed Data 

Figure 4: City Propagation: Missing geo-tags from the example database 
(see [5]) are estimated by propagating the known cities/geo- locations across 
the network of authors and papers. The graph for propagating the informa- 
tion (a) is constructed as follows. For each author A and paper Id there is 
a node. Two nodes are connected if they are written by the same author 
in the same or subsequent years or if two researchers co-author them. The 
colors of nodes indicate known cities and white nodes indicate unknown lo- 
cations. As one can see (b), this significantly improves the content of our 
database. The number of geo-tagged author-paper-pairs increased signifi- 
cantly, showing the publication activities across the world. (Best viewed in 
color 



adds a weight between two nodes if the nodes belong to two authors that 
co-author the paper associated with nodes i and j. The second rule, 

W{j+ = A2 if author (O = author (j) A year (z) = year (j) 

adds a weight whenever two nodes corresponds to different publications by 
the same author in the same year. Finally, 

Wij+ = A3 if author (O = author (j) A year (O = year (j + 1) 

fires when the nodes belong to two publications of the same author but 



written in subsequent years. This construction process is depicted in Fig. [4a 
for the example publication database in Fig. |3j The example database is 
missing the affiliation information for papers 4 and 5 which is denoted by 
the "?" in the "Aff" -column. 

Based on the constructed graph, we can now build a symmetric (n x n) 
similarity matrix W that is used as input to LP. Essentially, LP performs the 
following matrix-matrix- multiplication until convergence: Y t+1 = W • Y t , 
where Y l is the labels matrix. In Y f , row i corresponds to a distribution 
over the possible labels for a node i. In F°, we set a cell Y^ to 1 if we know 
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Figure 5: Most productive research cities in the world. The diameter is 
proportional to the number of publications. 



that node i has label j. All other cells are set to 0. After every iteration, a 
push-back phase clamps the rows of the known nodes in Y l to their original 
distribution as in Y°. This operation is performed until convergence or a 
maximum number of iterations has been reached. At convergence, the labels 
of the unknown nodes are read off the labels matrix, i.e. the label of node 
i is given by yi — argmax 0<J<n _ 1 Y{j . In our context, we call this City 
Propagation (CP), that is we run LP on the graph, constructed based on 
logical rules, to get a distribution over the possible cities for every unlabeled 
node. 

Although the implementation of CP is just a simple matrix-matrix- 
multiplication, this already becomes challenging with n around five million. 
While the similarity matrix W is very sparse, the labels matrix Y becomes 
denser with every iteration. Resulting in an almost pure dense matrix if 
the graph was completely connected. With 4k+ labels, the labels matrix 
already requires more than 160GB with 64bit float numbers. Fortunately, 
one can easily split the labels matrix into chunks and do the multiplications 
separately. However, we still require an efficient implementation for multi- 
plying a sparse-matrix with a dense-matrix. We implemented CP with the 
help of LAMA[j a very efficient linear algebra library. We ran CP for 100 
iterations and determined the maximizing label for every unlabeled node. 
We used Ai = 1, A2 = 3, and A3 = 2 as weights. They had been found using 
a grid search on a small subset of the data. After running CP, GeoDBLP 
contains 4,318,206 geo-tagged author-paper-pairs. 

Looking at the last column in our running example in Fig. [3j we see that 
CP fills the unknown cities, i.e. labels the missing affiliations for papers 4 



and 5. The effect of running CP on our initial dataset is shown in Fig. 4b 



One can see that the worldwide productivity increases significantly. The 
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Figure 6: Individual propensities and (inter-) arrival times illustrated for 
the two researchers Al and A2 of our running example. A researchers's 
propensity (shown only for A2) is her probability of migrating. The fcth 
move propensities are her probability of making k > 1 moves. This should 
not be confused with the (inter-) arrival times of the job market, i.e., of 
the overall Poisson process. Every node denotes a publication and the node 
colors denote different affiliations, i.e. there are three affiliations here: green, 
red, and blue. From this, we can read off migration: Al moves from green 
to red, A2 moves from blue to red and from red to green. (Best viewed in 
color) 

geo-locations of publications alone can already reveal interesting insights 
such as the most productive research cities in the world, see Fig. [5j The 
main focus of the paper, however, is the timing of migration. 

3 Sketching Migration 

Unfortunately, we cannot directly observe the event of transfer from one res- 
idential location resp. institution to another by a researcher. Instead, we use 
the affiliations mentioned in her publication record as a proxy. Nevertheless, 
even after city propagation, this list may still be noisy and, hence, does not 
provide the timing information easily. To illustrate this, an author may very 
well move to a new affiliation and publish a paper with her old affiliation 
because the work was done while being with the old affiliation. Therefore 
we considered migration sketches only. Intuitively, a sketch captures only 
the main stations of her researcher career. 

More formally, we define a migration sketch as the set of the unique affili- 
ations of an author ordered by the first appearance in the list of publications. 
For instance, in our running example, we have [2000 : Aff^,2002 : Aff r ] for 
author Ai and [2000 : Aff 6 ,2001 : Aff r ,2004 : AS g ] for author A 2 . That 
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Figure 7: Migration statistics over time in GeoDBLP. The ration between 
moves and authors per year (Fig. |7b|) does not grow as fast as the number 



of hops (Fig. 7a) or authors (Fig. |2b| ). Moreover, this illustrates that the 
job market is actually an inhomogeneous Poisson process that locally, say 
for periods of 10 years, can well be assumed to be homogeneous. 



is, Author A\ has two different affiliations, Aff g appearing in 2000 the first 
time and the first publication with Aff r in 2002. Of course, this approach 
has the drawback that we can not capture if a person returns to an ear- 
lier affiliation after several years. Finally, we dropped implausible entries 
from the resulting sketch database. For instance, we dropped sketches with 
more than ten affiliations. It is very unlikely that a single person has moved 
more than ten times and these sketches should rather be attributed to an 
insufficient entity disambiguation. Having the migration sketches at hand, 
we can now define a migration/ move of a researcher as the event of transfer 
from one residential location to another by a researcher in her migration 
sketch. Fig. [6] shows the moves of author A2 in our running example. In 
total, we found 310,282 migrations in GeoDBLP. The number of moves per 
year is shown in Fig. [7a] and it shows that the number of moves increases 
with the years super-linearly. However, when we normalize the numbers of 



moves by the number of scientists, we see roughly a linear slop, see Fig. 7b 



With this information at hand, we can now start to investigate the statistical 
properties of researcher migration. 

4 Regularities of Timing Events 

As mentioned above, reasons to migrate are manifold. Despite this complex 
web of interactions, we now show that researcher migration shows remark- 
ably simple but strong global regularities in the timing. 
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Figure 8: Migration propensity. The individual migration propensity is best 
fitted by a log-normal distribution. That is, although jobs enter the market 
all the time, researchers are generally not "memoryless" but have to care 
greatly about their next move, and this timing is a multiplicative function 
of many independently distributed factors. 



4.1 (Rl) Migration Propensity is Log-Normal 

Given the migration sketches, we can now read off timing information. First, 
we estimate the propensity to transfer to a new residential location or in- 
stitution across scientists. To do so, let T- be the point in time when a 
researcher moves from one location to the next one. Let tj be the time 
between the T-_ ± and T- . We call £-, i.e. the time between two moves, the 
migration propensity (see Fig. [6]). It reflects the bias of researchers to stay 
for a specific amount of time until moving on. 

Fig. [8] shows the best fitting distribution in terms of log-likelihood and 
KL-divergence among various distributions such as log-normal, gamma, ex- 
ponential, inverse- Gauss, and power-law using maximum likelihood estima- 
tion for the parameters. It is a log- normal distribution [TJ[28]. That is, the 
log of the propensity is normal distribution with density 
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The parameters fi and a 2 > are the mean and the standard deviation of 
the variable's natural logarithm. This is a plausible model due to Gibrat's 
"law of proportionate effects" [26] . The underlying propensity to move is a 
multiplicative function of many independently distributed factors, such as 
motivation, open positions, short-term contracts, among others. That is, 
such factors do not add together but are multiplied together, as a weakness 
in any one factor reduces the effects of all the other factors. That this leads 
to log- normality can be seen as follows. Recall that, by the law of large 
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Figure 9: Migration propensities are remarkably similar across continents 
and again best fitted by log-normal. Thus, timing research careers has no 
cultural boundaries across continents. (Best viewed in color) 

numbers, the sum of independent random variables becomes a normal dis- 
tribution regardless of the distribution of the individuals. Since log-normal 
random variables are transformed to normal random variables by taking 
the logarithm, when random variables are multiplied, as the sample size in- 
creases, the distribution of the product becomes a log-normal distribution 
regardless of the distribution of the individuals. This might explain why the 
log-normal distribution is one of the most frequently observed distributions 
in nature and describes a large number of physical, biological and even socio- 
logical phenomena |20| . For example, variations in animal and plant species 
just as in incomes appear log-normal, i.e. normal when presented as a func- 
tion of logarithm of the variable. Dose-response relations just as grain sizes 
from grinding processes show log-normal distributions. Moreover, although 
the overall job market is a Poisson process, as we will show later on, it is 
good that the migration propensity is not exponential. It is precisely this 
non-Poisson that makes it possible to make predictions based on past obser- 
vations. Since positions are occupied in a rather regularly way, upon taking 
a position it is very unlikely that you will take up another position soon. In 
the Poisson case, which is the dividing case between clustered and regular 
processes, you should be indifferent to the time since the last position. 

Based on our data, a computer scientist stays on average 5 years at a 
place. Thus headhunters, for example, should approach young potentials in 
their fourth year. On the other hand, one should probably reconsider the 
common practice, e.g. in the EU and the US, of having projects lasting only 
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Figure 10: Zooming in on migration propensities: across the most productive 
countries they are best fitted by log-normal. Actually, the representative 
countries USA, China, Germany, UK, Australia, Singapore, Canada, France, 
Italy, and Hong Kong are shown. Except for China, all are best fitted by log- 
normal. China's migration propensity follows a gamma distribution. (Best 
viewed in color) 



three years to fill in the gap. More importantly, the log-normality of the 
propensity can be found across continents and countries of the world, see 
Figs. [9] and 10, where we considered only moves originating from a continent 
resp. country. Timing research careers has clearly no cultural boundaries! 



4.2 (R2) k-th Move Propensities are Gamma 



Fig. [TT] shows the best fitting distribution in terms of log-likelihood and 
KL-divergence among various distributions such as log-normal, gamma, ex- 
ponential, inverse- Gauss, and power-law using maximum likelihood estima- 
tion for the propensity to make k > 1 migrations. More precisely, the kth 
move propensity for an author A{ is defined as s\ — J2j=i t)- ^ * s a gamma 
distribution, 

9< x ) = f{i^ xk ~ le ~ f ' (2) 

with shape k > 0, scale 9 > 0, and T(k) = J °° s k ~ 1 e~ s ds , suggesting 
that migration at later career stages is "memoryless" . Why? Well, this 
follows from the theory of Poisson processes. For Poisson processes, we know 
that the inter-arrival times are independent and obey an exponential form, 
exp(t) = \e~ Xd , where A > is called the intensity rate. The important 
consequence of this is that the distribution of t conditioned on {t > s} is 
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Figure 11: kth move propensities. The kth move propensities (left-right, 
top-down with k = 2, 3, 4, 5) are best fitted by gamma distributions. This 
suggests that migration at later career stages is "memoryless" , i.e., it follows 
an exponential distribution. 

again exponential. That is, the remaining time after we have not moved to a 
new position at time s has the same distribution as the original time t, i.e., 
it is memoryless. Moreover, we know that the time until the k-th move — 
the kth move propensity — has a gamma distribution; it is the sum of the 
first k propensities of senior researchers. So, the propensities for the next 
move turn exponential for later career stages. This is plausible, since early 
career researchers have seldom taken many positions and, hence, we consider 
here rather senior researchers, which typically have permanent positions; 
they do not have to greatly care about their moves. As a consequence, e.g. 
competing universities have to top the current position of a senior researcher 
if they want to hire her. 

4.3 Job Market is Poisson Log-Normal 

So far, we have shown that the propensities, let us call it 5, to move to a 
new residential location resp. institute follow a log-normal distribution. We 
have also shown that kth move propensities follow a gamma distribution, 
suggesting that propensities of senior researchers are exponential. The latter 
fact already points towards a Poisson model. More precisely, we postulate 
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Figure 12: (Left) Brain circulation follows a gamma distribution. (Right) 
The inter-city migration frequency follows a power-law (after removing low- 
frequency connections). 

that the job market follows a Poisson-log-normal model [28]. That is, given 
a specific scientist's migration propensity 5, her probability of migrating 
follows a simple Poisson model: pos(k) = 1/fc! • (5 k e~ k ) , for k = 1, 2, 3, 4, .... 
Thus the rate of the Poisson process is a function of the migration propensity. 
The number of migrations for all scientists having the same 5 value will follow 
the same Poisson process. Moreover, since the sum of Poisson processes is 
again a Poisson process, we know that every finite sample of scientists with 
6s drawn from a log-normal is again following a Poisson process. Thus, 
assuming the job market to be a Poisson model is plausible. It actually tells 
us that the arrival of job openings is memoryless. Open positions should 
always be announced as they come. On a global scale, there is no point in 
waiting to announce them. There are always researchers ready to take it. 
And, individual researchers can always look out for new job openings. 

4.4 (R3) Brain Circulation is gamma 

Brain circulation, or more widely known as brain drain, is the term generi- 
cally used to describe the mobility of high-level personnel. It is an emerging 
global phenomenon of significant proportion as it affects the socio-economic 
and socio-cultural progress of a society and a nation, and the world. Here, 
we defined it as the time until a researcher returns to the country of her 
first publication. Only 29, 398 out of 193, 986 (15%) mobile researchers, i.e., 
researchers that have moved at least once, and out of a total of 1,080,958 
(3%) researchers returned to their roots (in terms of publications). As to 
be expected from the statistical regularity for fcth move propensities, it also 
follows a gamma distribution, as shown in Fig.[l2[left). Since a gamma dis- 
tribution is the sum of exponential distributions, returning is memory less. 
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Researchers cannot plan to return to their roots but rather have to pick up 
opportunities as they arrive. 

5 Link Analysis of Migration 

Link analysis techniques provide an interesting alternative view on our mi- 
gration data. That is, we view migration as a graph where nodes are cities 
and directed edges are migration links between cities. More formally, the 
author-migration graph is a directed graph G — (V, E) where each vertex 
v E V corresponds to a city in our database. There is an edge eEfi from 
vertex v\ to vertex V2 iff there is an author who has moved from an affiliation 
in city v\ to V2- 

5.1 (R4) Inter-City Migration is Power-Law 

Triggered by Zipf's early work and other recent work on inter-city migra- 
tion [3U El E7] , we investigated the frequency of inter-city researcher migra- 
tion. The frequency of a connection between two cities can be seen as knowl- 
edge exchange rate between the cities. It is a kind of knowledge flow because 
one can assume that researchers take their acquired knowledge to next af- 
filiation. If one looks at the author-movement-graph as a traffic network, 



high frequent connections corresponds to highly used streets. Fig. 12 jight) 
shows the distribution with a fitted power-law using maximum likelihood es- 
timation. A likelihood comparison to other distributions such as log-normal 
and gamma revealed that a power-law is the best fit. Thus, there are only 
few pairs of cities with frequent researcher exchange and many low-frequent 
pairs. However, cities with a high exchange of researchers will exchange even 
more researchers in the future. Investments into migration pay off. 

5.2 (SP 1) Migration Authorities and Hubs 

Next, we are interested in mining the migration authorities and hubs. To do 
so, we use Kleinberg's HITS-algorithm[14j on the author- migration graph. 
The algorithm is an iterative power method and returns two scores for every 
node in the graph, which are known as hubs and authorities. This termi- 
nology arises from the web where hubs and authorities represent websites. 
Hubs are pages with many outlinks and authorities are pages with many 
inlinks. 

In our context, inlinks correspond to researchers arriving in a city — she 
picks up a new position — whereas an outlink corresponds to a researcher 
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Figure 13: (left and middle) Running HITS on the directed author-migration 
graph reveals sending, receiving, and incubator countries. Shown are rep- 
resentative cities in North America (left) and Europe (middle). The size of 
spikes encodes the value of the "authority" (blue) and "hub" (red) scores. 
Incubator cities have well balanced scores. As one can see, the European 
cities rather send researchers. US cities at the east cost are incubators, and 
west cost cities receive researchers, (right) Top 25 migration cities ranked 
by PageRank. Compared to the productivity map in Fig. [5j one can see that 
productive cities are not necessarily cities with high migration flux. (Best 
viewed in color) 



leaving a city — e.g. funding ends. Hubs can be seen as "sending" cities, 
i.e., they send out researcher across the world. On the other hand, author- 
ities can either be cities where people want to stay and tenure positions 
are available or where people drop out of research, i.e. heading to industry. 
They are "receiving" cities. Moreover, if we make the assumption that only 
high-quality students and scientists get new positions, one may view sending 
cities as institutions producing high profile scientists but also cannot hold 
all of them, due to restricted capacities or low attractiveness. In contrast, 
receiving cities might have the capacities and reputation to hold many mi- 
grating researchers or highly interesting industrial jobs are close by. Cities 
having generally high scores are incubators: they attract a lot of migration 
but also send them to other places. 



Fig. 13 shows the sending and receiving scores for cities in the repre- 
sentative regions of the US and Europqj The US clearly shows an East- 
coast /west-coast movement. The east coast aggregates many sending cities 
while receiving cities dominate the west coast. This is plausible. Not only 
are there many highly productive universities on the west coast, see also 
Fig. [5j but labor market for high-tech workers in, say, the Bay Area is the 
strongest in a decade. Thousands of new positions are being offered by small 
startups and established tech giants. However, one should view many of the 
east-coast cities as incubators since they have high overall scores. The scores 



Rendered with WebGL Globe (see http://www.chromeexperiments.com/globe). 
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of European cities are typically much smaller, see again Fig. 13 'right). Eu- 
rope is dominated by sending cities. Few exceptions are Berlin, Munich, 
Stockholm, and Zurich. The largest receiving city in the world is Singapore. 
This is also plausible. The city-state is known for its remarkable investment 
in research in recent years, as e.g. noted by a recent Nature Editorial [7pJ 
In contrast, the largest sending city by far is Beijing. This is also plausible. 
There has been upsurge in Chinese emigration to Western countries since 
the mid- first decade of the 21st century [15]. In 2007, China became the 
biggest worldwide contributor of emigrants. 

5.3 (SP2) Moving Cities 

Following up on HITS, we also computed PageRank on the author- migration 
graph. Compared to HITS, PageRank [23J produces only a single score: a 
page is informative or important if other important pages point to it. More 
formally, by converting a graph to an ergodic Markov chain, the PageRank 
of a node v is the (limit) stationary probability that a random walker is at v. 
In the context of migration, this has a natural and very appealing analogy. 
The PageRank computes the (limit) stationary probability that a random 
migrator is at a city. 

To compute the MigrationRank of a city, the author-migration graph is 
transformed into the PageRank-matrix on which a power method is applied 
to obtain the PageRank- vector, containing a score for every node in the 
graph. The transformed matrix also contains the stochastic adjustment 
identical to the random surfer in the original work. That is, a researcher 
can always migrate from one affiliation to another affiliation, even if no 



one else did so before. Fig. 13 jight) shows the top 25 cities in the world 
according to the MigrationRank. Compared to the productive map in Fig.[5j 
one can clearly see many similarities but although notable differences. The 
US is not only productive but thrives on migration. Vancouver, B.C., is 
among the top 25 when it comes to migration but not when it comes to 
productivity. Generally, productivity does not imply a high migration rank. 
Beijing, however, is top in both when it comes to productivity and migration. 
Singapore is higher ranked for migration than for productivity. European 
cities seem to also thrive on migration more than on productivity. At least 
there are much more cities in the top 25 than for productivity. However, 
compared to the US, they are less clustered together 



10 The recent economic pressure mounting on research communities in Singapore and 
around the world is not well captured in our data, which lasts to 2010 only. 
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Figure 14: Prototypical migration career of a computer scientist according 
to the WWW. Shown are the mean values for (fcth move) propensities and 
brain circulation. That is, on average a scientist makes the next move after 

5 years (green). Making two moves takes on average 8 years (red), and three 
moves 11 years (red). She moves back to her roots, if at all, after 8 years 
(blue). (Best viewed in color) 

6 Conclusions 

International mobility among researchers not only benefits the individual 
development of scientists, but also creates opportunities for intellectually 
productive encounters, enriching science in its entirety, preparing it for the 
global scientific challenges lying ahead. Moreover, mobile scientists act as 
ambassadors for their home country and, after their return, also for their 
former host country, giving mobility a culture-political dimension. So far, 
however, no statistical regularities have been established for the timing of 
migration. In this paper, we have established the first set of statistical 
regularities and patterns for research migration stemming from inferring 
and analyzing a large-scale, geo-tagged dataset from the web representing 
the migration of all researchers listed in DBLP. The methods and findings 
highlight the value of using the World Wide Web together with data mining 
to fill in missing data as a world-wide lens onto research migration. 

Specifically, we described the creation of GeoDBLP that, in contrast to 
existing migration research, involved propagation of only few seed locations 
across bibliographic data, namely the DBLP network of authors and pa- 
pers. The result was a database of over 5 million unique author-paper-pairs 
mostly labeled with geo-tags, which was used for a detailed statistical analy- 
sis. The statistical regularities and patterns discovered are encouraging: we 
could estimate statistical regularities for migration propensities that align 
well but actually go beyond knowledge in the migration and scientometric 
literature — typically concluded from small-scale, unregistered data only 
— and establish for the first time that there are no cultural boundaries for 
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the timing events underlying migration. The statistical regularity remains 
similar no matter what country you are looking at. Thus, moving on to a 
new position is a common pattern in terms of timing across different coun- 
tries from the US to China over Germany, and Australia and independent 
of geography, ideology, politics or religion. The resulting prototypical mi- 
gration career is sketched in Fig. [l4j This is interesting, since, if nations 
want to get back their high-level personnel, they have to do that just before 
the second move, on average in the 7th year. Otherwise, it is likely that the 
high-level personnel does not come back anymore. And recall that only 3% 
of all scientists actually return. If you miss this opportunity, you will have 
to invest much more, since moving in later stages in a career is memoryless; 
there is no pressure for high-level personnel to move. On average scientists 
move every 5 years. This high value is due to dominance of researchers in 
early academic career stages. For senior scientists, that are the minority, 
this turns into a gamma distribution. For instance, we make two moves 
within 8 years on average, while making three moves takes on average 11 
years. Analyzing the author-migration graph reveals for instance that China 
is the largest migration hub in the world, whereas Singapore is the largest 
migration authority. Generally, the east cost of the US receives and sends 
out researchers; the east cost is an incubator. In contrast, the west coast of 
the US is large migration authority, probably due to strong new economy 
and better climate. People have had this suspicion but we are showing on a 
very large scale that this insights go beyond folklore. 

In general, our findings suggest that the WWW, together with data 
mining to deal with missing information, may complement existing migra- 
tion data sources, resolve inconsistencies arising from different definitions 
of migration, and provide new and rich information on migration patterns 
of computer scientists. However, a lot remains to be done. One should 
monitor migration over time and validate gravity models for international 
migration. One should also investigate the distribution over distances trav- 
eled when migrating. It is certainly more complex and most likely follows a 
mixtures of distributions. Initial results show that there are several modes, 
indicating that there are cultural boundaries. Other interesting avenues for 
future work are geographical topic models to discover research trends across 
the world and to realize expert finding systems that know where the ex- 
perts are at any time. The most promising direction is to extend our results 
beyond computer science. 

Nevertheless, our results are an encouraging sign that harvesting and 
inferring data from the web at large-scale may give fresh impetus to demo- 
graphic research; we have only started to look through the world-wide web 
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lens onto it. 
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